Introduction

Materials and methods

Materials

We are working with a dataset about Breast Cancer that we have obtained from kaggle website

This is the dataset we are working with:

##         patient_id          education     id_healthcenter id_treatment_region
##  111035895969:  1   Diploma      :265   1110000154: 17    1110000329:321     
##  111035896483:  1   Elementary   :150   1110000280: 13    1110000330:305     
##  111035897677:  1   Middle School:122   1110000303: 13    1110000331:213     
##  111035897739:  1   Bachelor     : 91   1110000305: 11                       
##  111035897959:  1   Illiterate   : 89   1110000181: 10                       
##  111035898042:  1   High School  : 65   1110000225: 10                       
##  (Other)     :833   (Other)      : 57   (Other)   :765                       
##  hereditary_history   birth_date        age            weight     
##  0:359              Min.   :1939   Min.   : 1.00   Min.   :  6.0  
##  1:480              1st Qu.:1979   1st Qu.:28.00   1st Qu.: 69.0  
##                     Median :1986   Median :33.00   Median : 78.0  
##                     Mean   :1984   Mean   :35.14   Mean   : 75.1  
##                     3rd Qu.:1991   3rd Qu.:40.00   3rd Qu.: 86.0  
##                     Max.   :2018   Max.   :80.00   Max.   :101.0  
##                     NA's   :2                                     
##  thickness_tumor  marital_status        marital_length pregnency_experience
##  Min.   :0.0100   0:201          above 10 years:446    0:205               
##  1st Qu.:0.4000   1:638          under 10 years:393    1:634               
##  Median :0.6000                                                            
##  Mean   :0.5747                                                            
##  3rd Qu.:0.8000                                                            
##  Max.   :1.3000                                                            
##                                                                            
##   giving_birth age_FirstGivingBirth abortion     blood     taking_heartMedicine
##  1      :400   above 30:466         0:686    A+     :199   0:317               
##  0      :198   under 30:373         1:153    A-     :139   1:522               
##  2      :131                                 AB+    :136                       
##  3      : 79                                 B+     :122                       
##  4      : 14                                 AB-    : 86                       
##  5      : 12                                 (Other):156                       
##  (Other):  5                                 NA's   :  1                       
##  taking_blood_pressure_medicine taking_gallbladder_disease_medicine smoking
##  0:249                          0:385                               0:572  
##  1:590                          1:454                               1:267  
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##  alcohol breast_pain radiation_history Birth_control  menstrual_age
##  0:531   0:323       0:418             0:312         above 12:344  
##  1:308   1:516       1:421             1:527         not yet : 25  
##                                                      under 12:470  
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##   menopausal_age Benign_malignant_cancer           condition   treatment_age  
##  above 50: 37    Benign   :335           death          :424   Min.   : 1.00  
##  not yet :744    Malignant:504           recovered      :144   1st Qu.:28.00  
##  under 50: 56                            under treatment:271   Median :33.00  
##  NA's    :  2                                                  Mean   :35.16  
##                                                                3rd Qu.:40.00  
##                                                                Max.   :80.00  
##                                                                NA's   :2

Cleaning the data{.smaller #1}

## `geom_smooth()` using formula 'y ~ x'

Before After
The columns are different types All the columns are considered as doubles
0, 1, 2 values bolean variables
names with /r/n Clean names
Birth date with 3 characters Birth date with 4 characters
Blood type44 Correct blood types only
Weird weight/age correlations Eliminating people under 18 years old

For statistical analysis, we have chosen only women.

Augmenting the data

  • We have added more informative columns
  • We have changed the type of the columns

Statistical analysis

We have created some plots in order to fully understand the data and we have done some statistical analysis like MCA analysis. The plots are shown in the following point: “Results”

Results

Plots of the data

  • Cultural variables we don’t actually need (education, marital_state)
  • Variables that affect health (medicines, vicious habits) have a great incidence in breast cancer patients
  • Early menstrual periods before age 12 and starting menopause after age 55 expose women to hormones longer, raising their risk of getting breast cancer

Plots

  • In the data, most of the people is quite young (under 40)
  • The treatment age is consistent with the age of the patients
  • The thickness tumor is the most changing variable

  • In all the cases, the recovered fraction is the lowest one.
  • In the radiation history, the patients who have not suffered from radiation have recovered better than the ones that have had radiation.
  • In most cases, when having taken medicine the recovery is better.
  • In most cases, when having taking medicine the death is higher (no sense).
  • Not drinking alcohol or smoking improves recovery.
  • When taking alcohol and smoking the death is lower (it doesn’t make any sense)
  • These are absolute values, maybe we should calculate some relative values

  • Above 40 years old, the possibilities for recovering from a breast cancer are higher
  • The weight is higher in recovered patients
  • Patients who died weighted very low (around 25 kg)
  • In addition, other group of persons who died weighted 75 kg
  • Thickness of tumor is bigger in patients who died

  • When the tumor thickness augments, the medical treatment has more impact

  • The birth date is totally correlated to the treatment age with negative values
  • So it is the age, but with positive values
  • There is no a clear relation between the thickness tumor and the rest of variables

  • Along the years, the patients are treated before
  • Along the years, the education levels are more diverse
  • The age of first giving birth is more or less consistent along the years
  • When the age of treatment is reduced, the thickness tumor is more diverse

Discussion

Discussion

Conclusion

We have reached the following conclusions

THANK YOU
FOR YOUR ATTENTION